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A performance analysis of PIM, streann processing, and tiled processing on memory- Q 
intensive signal processing kernels 

Jinwoo Suh, Eun-Gyu Kim, Stephen P. Crago, Lakshmi Srinivasan, Matthew C. French 
May 2003 ACM SIGARCH Computer Architecture News , Proceedings of the 30th 

annual international symposium on Computer architecture ISCA '03, volume 

31 Issue 2 

Publisher: ACM Press 

Full text available: pclf(239.50 KB ) Additional Information: full citation , abstract , references 

Trends in microprocessors of increasing die size and clock speed and decreasing feature 
sizes have fueled rapidly Increasing performance. However, the limited improvements in 
DRAM latency and bandwidth and diminishing returns of increasing superscalar ILP and 
cache sizes have led to the proposal of new microprocessor architectures that implement 
processor-in- memory, stream processing, and tiled processing. Each architecture is 
typically evaluated separately and compared to a baseline architectu ... 

PSCP: a scalable parallel ASIP architecture for reactive systems Q 
A, Pyttel, A. Sedlmeier, C. Veith 

February 1998 Proceedings of the conference on Design, automation and test in 
Europe 

Publisher: IEEE Computer Society 

Full text available: « pdf{208^ KB). Additional Infomnation: full citation , abstract , references, citings , index 

W Publisher Site i^™^ 

We describe a Codesign approach based on a parallel and scalable ASIP architecture, 
which is suitable for the implementation of reactive systems. The specification language of 
our approach is extended statecharts. Our ASIP architecture is scalable with respect to the 
number of processing elements as well as parameters such as bus widths and register file 
sizes. Instruction sets are generated from a library of components covering a spectrum of 
space/time trade-off alternatives. Our approach featu ... 

Keywords: FPGA, application-specific, statechart, modular 
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Ingrid Verbauwhede, Patrick Schaumont 

April 2004 Proceedings of the 1st conference on Computing frontiers 
Publisher: ACM Press 

Full text available: ^ pdf(398.28 KB) Additional Information: full citation , abstract, references , index tenns 

New applications and standards are first conceived only for functional correctness and 
without concerns for the target architecture. The next challenge is to map them onto an 
architecture. Embedding such applications in a portable, low-energy context is the art of 
molding It onto an energy-efficient target architecture combined with an energy efficient 
execution. With a reconfigurable architecture, this task becomes a two-way process where 
the architecture adapts to the application and vice-vers ... 

Keywords: embedded, real-time systems 


Space-time scheduling of instruction-level parallelism on a raw machine 

Walter Lee, Rajeev Barua, Matthew Frank, Devabhaktuni Srikrishna, Jonathan Babb, Vivek 

Sarkar, Saman Amarasinghe 

October 1998 ACi^ SIGPLAN Notices , ACM SIGOPS Operating Systems Review , 

Proceedings of the eighth international conference on Architectural 
support for programming languages and operating systems ASPLOS- 
VIII, Volume 33 , 32 Issue 11,5 

Publisher: ACM Press 

Additional Information: full citation , abstract , references , citings , index 


Full text available, . ^ r—. r 

terms 

Increasing demand for both greater parallelism and faster clocks dictate that future 
generation architectures will need to decentralize their resources and eliminate primitives 
that require single cycle global communication. A Raw microprocessor distributes all of its 
resources, Including instruction streams, register files, memory ports, and ALUs, over a 
pipelined two-dimensional mesh interconnect, and exposes them fully to the compiler. 
Because communication in Raw machines Is distributed, com ... 

Exploiting choice: instruction fetch and issue on an implementable simultaneous Q 
multithreadin g processor 

Dean M. Tullsen, Susan J. Eggers, Joel S. Emer, Henry M. Levy, Jack L Lo, Rebecca L. 
Stamm 

May 1996 ACM SIGARCH Computer Architecture News , Proceedings of the 23rd 

annual international symposium on Computer architecture ISCA '96, volume 
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Publisher: ACM Press 
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Full text available: TO pdf (1.48 MB ) ; — . 9- 
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Simultaneous multithreading is a technique that permits multiple independent threads to 
issue multiple instructions each cycle. In previous work we demonstrated the performance 
potential of simultaneous multithreading, based on a somewhat idealized model. In this 
paper we show that the throughput gains from simultaneous multithreading can be 
achieved without extensive changes to a conventional wide-issue superscalar, either in 
hardware structures or sizes. We present an architecture for s ... 

Strategies for achieving improved processor throughput Q 
Matthew K. Farrens, Andrew R. Pleszkun 

April 1991 ACM SIGARCH Computer Architecture News , Proceedings of the 18th 

annual international symposium on Computer architecture ISCA '91, volume 
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March 1977 ACM Computing Surveys (CSUR), volume 9 issue i 
Publisher: ACM Press 
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Poly g on rendering on a stream architecture Q 
John D. Owens, Williann 3. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery 
August 2000 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS worlcshop on 

Graphics hardware 
Publisher: ACM Press 

I- II* ^ -I ui cc Additional Information: full citation , abstract , references , citings, index 

Full text available: if] p df flDl.Do KB) 

terms 

The use of a programnnable stream architecture in polygon rendering provides a powerful 
mechanism to address the high performance needs of today's complex scenes as well as 
the need for flexibility and programmability in the polygon rendering pipeline. We describe 
how a polygon rendering pipeline maps into data streams and kernels that operate on 
streams, and how this mapping is used to implement the polgyon rendering pipeline on 
Imagine, a programmable stream processor. We compare our resul ... 

Keywords: OpenGL, SIMD, graphics hardware, kernels, media processors, polygon 
rendering, stream architecture, stream processing, streams 


A survey of processors with explicit multithreadin g 
Theo Ungerer, Borut Robic, Jurij Silc 

March 2003 ACM Computing Surveys (CSUR), volume 35 issue i 
Publisher: ACM Press 

r- „ * ^ I ui 01 ^«noA i^Dx Additional Information: full citation , abstract , references , citings , index 

Full text available: TO pdf(920.16 KB) : 

^ terms 

Hardware multithreading is becoming a generally applied technique in the next generation 
of microprocessors. Several multithreaded processors are announced by industry or 
already into production in the areas of high-performance microprocessors, media, and 
network processors.A multithreaded processor is able to pursue two or more threads of 
control in parallel within the processor pipeline. The contexts of two or more threads of 
control are often stored in separate on-chip register sets. Unused i ... 

Keywords: Blocked multithreading, interleaved multithreading, simultaneous 
multithreading 


''O On pipelining dynamic instruction scheduling lo gic Q 
Jared Stark, Mary D. Brown, Yale N. Patt 

December 2000 Proceedings of the 33rd annual ACM/IEEE international symposium 
on Microarchitecture 

Publisher: ACM Press 

Full text available: g pdfd 28.82 KB) 

[S ps(543.84 KB) Additional Infonmation: full citation , references , citings, index terms 



http://portaLacm.org/resultsxfin?coll=ACM&dl=ACM&CFID=69633554&CFTOK^ 


4/17/06 


Results (page 1): "data Path" and "register file" and "data element" and therad and "round robin" and pipeli... Page 4 of 6 

M Publisher Site 


GPGPU: general purpose computation on graphics hardware Q 

# David Luebke, Mark Harris, Jens Kruger, Tim Purcell, Naga Govindaraju, Ian Buck, Cliff 
Woolley, Aaron Lefohn 

August 2004 Proceedings of the conference on SIGGRAPH 2004 course notes GRAPH 
'04 

Publisher: ACIVI Press 

Full text available: ^ pdf(63.03 MB) Additional Information: full citation , abstract 

The graphics processor (GPU) on today's commodity video cards has evolved into an 
extremely powerful and flexible processor. The latest graphics architectures provide 
tremendous memory bandwidth and computational horsepower, with fully programmable 
vertex and pixel processing units that support vector operations up to full IEEE floating 
point precision. High level languages have emerged for graphics hardware, making this 
computational power accessible. Architecturally, CPUs are highly parallel s ... 

"12 Session 17: architecture: Sunder: a programmable hardware prefetch architecture for 

numerical loops 
Tzi-cker Chiueh 

November 1994 Proceedings of the 1994 ACM/IEEE conference on Supercomputing 
Publisher: ACM Press 

Full text available: ^ pdf(922.38 KB) Additional Infonmation: full citation , abstract , references , citings 

Beyond data caching, data prefetching is by far the most effective way to address the 
memory access bottlenecl< associated with high-performance processors. This is 
particularly true for scientific programs whose working sets cannot be easily fit into the 
on-chip data cache. This paper proposes a new data prefetching architecture called 
Sunder, which combines the flexibility and accurateness of software prefetching and the 
transparency and low-overhead of hardware prefetching. Th ... 

13 Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth Q 
Anne Bracy, Prashant Prahlad, Amir Roth 

December 2004 Proceedings of the 37th annual IEEE/ACM International Symposium 
on Microarchitecture MICRO 37 

Publisher: IEEE Computer Society 

Full text available: Q pdf(1 89.31 KB) Additional Information: full citation , abstract , citings 

A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the 
interface of a singleton instruction: two register inputs, one register output, a maximum 
of one memory operation, and a maximum of one (terminal) control transfer. Previous 
work has exploited dataflow sub-graphs whose execution latency can be reduced via 
programmable FPGA-style hardware. In this paper we show that mini-graphs can improve 
performance by amplifying the bandwidths of a superscalar processor's st ... 

14 Architectures: A programmable vertex shader with fixed-point SIMP datapath for low I I 

power wireless applications 
Ju-Ho Sohn, Ramchan Woo, Hoi-Jun Yoo 

August 2004 Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on 
Graphics hardware 

Publisher: ACM Press 

Full text available: ^ pdf(427.49 KB) Additional Information: full citation, abstract , references , index terms 

The real time 3D graphics becomes one of the attractive applications for 3G wireless 
terminals although their battery lifetime and memory bandwidth limit the system 
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resources for graphics processing. Instead of using the dedicated hardware engine with 
complex functions, we propose an efficient hardware architecture of low power vertex 
shader with programmabllity. Our architecture includes the following three features: I) a 
fixed-point SIMD datapath to exploit parallelism in vertex process ... 

The M-Machine multicomputer Q 
Marco Fillo, Stephen W. Keckler, William J. Dally, Nicholas P. Carter, Andrew Chang, Yevgeny 
Gurevich, Whay S. Lee 

December 1995 Proceedings of the 28th annual International symposium on 
Microarchitecture 

Publisher: IEEE Computer Society Press 

Full text available: Qpdf(1.29 MB) Additional Information: full citation , references , citings , index terms 


Graphics renderin g architecture for a high performance desktop workstation 
Chandlee B. Harrell, Farhad Fouladi 

September 1993 Proceedings of the 20th annual conference on Computer graphics 
and interactive techniques 

Publisher: ACM Press 
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17 Compilation: Cluster assignment of global values for clustered VLIW processors Q 



Andrei Terechko, Erwan Le Thenaff, Henk Corporaal 

October 2003 Proceedings of the 2003 international conference on Compilers, 


architecture and synthesis for embedded systems 

Publisher: ACM Press 

Full text available: ^ pdf(330.94 KB ) Additional Information: full citation, abstra ct, refere nces, index terms 

In this paper high-level language (HLL) variables that are alive in a whole HLL function, 
across multiple scheduling units, are termed as global values. Due to their long live 
ranges and, hence, large Impact on the schedule, the global values require different 
compiler optimizations than local values, which span across only one scheduling unit. The 
instruction scheduler for a clustered ILP processor, which Is responsible for cluster 
assignment of operations and variables, faces a difficult probi ... 

Keywords: ILP, VLIW, cluster assignment, compiler, instruction scheduler, register 
allocation 


18 Ray tracing vs. scan conversion: ConnparinQ Reves and OpenGL on a stream I I 

architecture 

John D. Owens, Brucek Khailany, Brian Towles, William J. Dally 

September 2002 Proceedings of the ACM SIGGRAPH/EUROGRAPHXCS conference on 
Graphics hardware 

Publisher: Eurographics Association 

Full text available- fill odfd 36 72 KB) A^^'*"®"^' Information: full citation , abstract , references , citings, index 
'. terms 

The OpenGL and Reyes rendering pipelines each render complex scenes from similar 
scene descriptions but differ in their internal pipeline organizations. While the OpenGL 
organization has dominated hardware architectures over the past twenty years, a Reyes 
organization differs in several important ways from OpenGL, including a shader coordinate 
system that supports coherent texture accesses, a single shader in the vertex stage, and 
tessellation and sampling instead of triangle rasterization. Hardw ... 
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19 A low-power memory hierarchy for a fully programmable baseband processor 
Wolfgang Raab, Hans-Martin Bluethgen, Ulrich Ramacher 

June 2004 Proceedings of the 3rd workshop on Memory performance issues: in 
conjunction with the 31st international symposium on computer 
architecture WMPI '04 

Publisher: ACM Press 

Full text available: ^ pdf(431 .55 KB) Additional Information: full citation , abstract , references , index terms 

Future terminals for wireless communication not only must support multiple standards but 
execute several of them concurrently. To meet these requirements, flexibility and ease of 
programming of integrated circuits for digital baseband processing are increasingly 
important criteria for the deployment of such devices, while power consumption and area 
of the devices remain as critical as in the past.The paper presents the architecture of a 
fully programmable system-on-chip for digital signal proces ... 

Keywords: baseband processor, low-power memory, memory hierarchy, multi-tasked 
processor, task Interleaving 
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^° Trident: a scalable architecture for scalar , vector, and matrix operations Q 
Mostafa I. Soliman, Stanislav G. Sedukhin 

January 2002 Australian Computer Science Communications , Proceedings of the 
seventh Asia-Pacific conference on Computer systems architecture - 
Volume 6 CRPITS '02, volume 24 Issue 3 

Publisher: Australian Computer Society. Inc. . IEEE Computer Society Press 

rr II* ^ -I ui 0k ^tiOAACALTox Additional Information: full citation , abstract , references , citings , index 

Full text available: Tq pdf(814.51 KB) : 

*^ terms 

Within a few years it will be possible to integrate a billion transistors on a single chip. At 
this integration level, we propose using a high level ISA to express parallelism to 
hardware instead of using a huge transistor budget to dynamically extract it. Since the 
fundamental data structures for a wide variety of applications are scalar, vector, and 
matrix, our proposed Trident processor extends the classical vector ISA with matrix 
operations. The Trident processor consists of a set of paralle ... 

Keywords: data parallelism, parallel processing, ring register file, scalable hardware, 
vector/matrix processing 
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